Corpus Refactoring: a Feasibility Study

نویسندگان

  • Helen L Johnson
  • William A Baumgartner
  • Martin Krallinger
  • K Bretonnel Cohen
  • Lawrence Hunter
چکیده

BACKGROUND Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps. RESULTS The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented. CONCLUSION We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Continuous Change Analysis to Understand the Practice of Refactoring

Despite the enormous success that manual and automated refactoring has enjoyed during the last decade, we know little about the practice of refactoring. Understanding the refactoring practice is important for developers, refactoring tool builders, and researchers. Many previous approaches to study refactorings are based on comparing code snapshots, which is imprecise, incomplete, and does not a...

متن کامل

Using Continuous Code Change Analysis to Understand the Practice of Refactoring

Despite the enormous success that manual and automated refactoring has enjoyed during the last decade, we know little about the practice of refactoring. Understanding the refactoring practice is important for developers, refactoring tool builders, and researchers. Many previous approaches to study refactorings are based on comparing code snapshots, which is imprecise, incomplete, and does not a...

متن کامل

A Comparative Study of Manual and Automated Refactorings

Despite the enormous success that manual and automated refactoring has enjoyed during the last decade, we know little about the practice of refactoring. Understanding the refactoring practice is important for developers, refactoring tool builders, and researchers. Many previous approaches to study refactorings are based on comparing code snapshots, which is imprecise, incomplete, and does not a...

متن کامل

Refactoring Corpora

We describe a pilot project in semiautomatically refactoring a biomedical corpus. The total time expended was just over three person-weeks, suggesting that this is a cost-efficient process. The refactored corpus is available for download at http://bionlp.sourceforge.net.

متن کامل

A Meta-model for Language-Independent Refactoring1

Refactoring —transforming code while preserving behaviour— is currently considered a key approach for improving object-oriented software systems. Unfortunately, all of the current refactoring tools depend on language-dependent refactoring engines, which prevents a smooth integration with mainstream development environments. In this paper we investigate the similarities between refactorings for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Biomedical Discovery and Collaboration

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2007